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Abstract. We investigate a population of binary mistake sequences that re- 
sult from learning with parametric models of different order. We obtain es- 
timates of their error, algorithmic complexity and divergence from a purely 
random Bernoulli sequence. We study the relationship of these variables to 
the learner's information density parameter which is defined as the ratio be- 
tween the lengths of the compressed to uncompressed files that contain the 

- learner's decision rule. The results indicate that good learners have a low in- 

formation densityp while bad learners have a high p. Bad learners generate 

- mistake sequences that are atypically complex or diverge stochastically from a 

purely random Bernoulli sequence. Good learners generate typically complex 

^ ^ sequences with low divergence from Bernoulli sequences and they include mis- 

+^ take sequences generated by the Bayes optimal predictor. Based on the static 

^i^^ algorithmic interference model of |18| the learner here acts as a static structure 

\^ which "scatters" the bits of an input sequence (to be predicted) in proportion 

to its information density p thereby deforming its randomness characteristics. 

l-H 

^ 1. Overview 

C/3 Ratsaby [18] introduced a quantitative definition of the information content of 

I ^ I a general static system (e.g. a solid or some fixed structure) and explained how 

it algorithmically interferes with input excitations thereby influencing its stability. 
04 His model is based on concepts of the theory of algorithmic information and ran- 

domness. He modeled a system as a selection rule of a finite algorithmic complexity 
which acts on an incoming sequence of random external excitations by selecting a 
subsequence as output. As postulated in [18^ a simple structure is one whose infor- 
mation content is small. Its selection behavior is of low complexity since it can be 
more concisely described. Consequently it is less able to deform properties of ran- 
domness of the input sequence. And vice versa, if the system is sufficiently complex 
it can significantly deform the randomness at the input. Following ||18j there have 
been recent theoretical and empirical results that validate his model for specific 
^ problem domains. The first empirical proof of his model appeared in [25| 1261 [7] 

where it was shown that this inverse relationship between system complexity and 
randomness exists also in a real physical system. The particular system investigated 
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5^ consisted of a one-dimensional vibrating solid-beam to which a random sequence 

of external input forces is applied. In [TF, 21J the problem of learning to predict 
binary sequences was shown to be an exemplar of this paradigm. The complexity 
of a learner's decision rule is proportional to the amount that the subsequence se- 
lected by the learner (via his mistakes) deviates from a truly random sequence. A 
first empirical investigation of this learning problem appeared in |23l |55J [21] where 
a new measure of system complexity called the sysRatio was introduced and shown 
to be a proper measure of a learner's decision complexity. 
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The current paper digs further along this line and provides not only further 
empirical analysis and justification of the model of (T^ applied to the problem of 
learning but also gives new interpretations of standard learning phenomena such 
as model data underfitting or overfitting. It is shown that these phenomena can 
be interpreted as certain types of deformations of randomness of the binary mis- 
take sequences. These deformations are measured in the A£-plane (A stands for 
divergence and t for estimated Kolmogorov complexity). We conclude that the 
prediction rule obtained by learning is analogous to a physical static object that 
scatters a random beam of particles. We call this phenomena bit-scattering (we dis- 
cuss this phenomenon later at the end of section |5| . The current paper is a further 
justification that the static algorithmic interference model defined in [18J applies 
to the problem of learning to predict. Before proceeding to give an introduction to 
the main concepts let us state the problem that we consider in the paper. 

Statement of the problem: Given a random source that generates two binary 
sequences, a;'™) and x^") of length m and n, respectively, according to a finite 
Markov chain of unknown order k* with an unknown probability transition matrix. 
A learner uses a; to estimate the probability parameters of a Markov model of 
order k. Once the model is learnt, the learner makes a prediction for every bit in 
x^^\ Denote by j/^"'' the binary sequence corresponding to these predictions. De- 
note by ^'^"^ the error sequence that corresponds to the learner's predictions where 
the i"* bit = 1 if the prediction differs from the true value, i.e., yi ^ Xi and (,i — 
otherwise. Denote by ^q"'' the subsequence of ^("^ corresponding to those bits of 
y^"^ that are 0. In this paper we study different characteristics of the error sequence 
^g"-* and how they depend on the two main learner's parameters, the training se- 
quence length m and the model order k. We focus on two main characteristics, the 
algorithmic complexity of the error sequence and the statistical deviation between 
the frequency of Is and the probability of seeing a 1 in the sequence. We determine 
their interrelationship and how the probability of a prediction error depends on 
them. 

The remainder of the paper is organized as follows: in section |2] we introduce the 
basic concepts of algorithmic complexity and related properties of randomness. In 
section[3]we review the concept of a selection rule, in section|4]we state a relationship 
between the complexity of a finite random binary sequence and its entropy. Section 
[5]describes the experimental setup used for the analysis followed by section [6] which 
describes the results. 

Before continuing, we should clarify at this point that our use of the words 
'chaoticity' or 'chaotic' is different from chaos theory. By a chaotic binary sequence 
we do not necessarily mean that it is generated by some dynamical system that 
is highly sensitive to initial conditions but that it is highly disordered, or in other 
words, has a high algorithmic complexity. 

2. Introduction 

Algorithmic randomness (see j6j|T2l|5]) is a notion of randomness of an individual 
element (object) of a sample space. It refiects how chaotic, or how complicated it is 
to describe the object. Classical probability theory assigns probabilities to sets of 
outcomes of random trials in an experiment. For instance, consider an experiment 
with n randomly and independently drawn binary numbers Xi, i = 1, . . .n, where 
Xi — 1 with probability 1/2. Then any outcome such as X = (0, 0, . . . , 0) has the 
same probability 2^". However, from an algorithmic perspective, it is clear that 
the string X = (0, 0, ... 0) is not random compared to some other possible string 
with a more complicated pattern of zeros and ones. Algorithmic randomness of 
finite objects (binary sequences) aims to explain the intuitive idea that a sequence. 
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whether finite or infinite, should be measured as being more unpredictable if it 
possess fewer regularities (patterns). There is no formal definition of randomness 
but there are three main properties that a random binary string of length n must 
intuitively satisfy |28j . The first property is the so-called stochasticity or frequency 
stability of the sequence which means that any binary word of length k < n must 
have the same frequency limit (equal to 2"'^'). This is basically the notion of nor- 
mality that Borel introduced and is related to the degree of unpredictability of the 
sequence. The second property is chaoticity or disorderliness of the sequence. A 
sequence is less chaotic (less complex) if it has a short description, i.e., if the mini- 
mal length of a program that generates the sequence is short. The third property 
is typicalness. A random sequence is a typical representative of the class fl of all 
binary sequences. It has no specific features distinguishing it from the rest of the 
population. An infinite binary sequence is typical if each small subset E of does 
not contain it (the correct definition of a 'small' set was given by Martin Lof |16|). 

Algorithmic randomness was first considered by von Mises in 1919 who defined 
an infinite binary sequence a of zeros and ones as random if it is unbiased, i.e. 
if the frequency of zeros goes to 1/2, and every subsequence of a that we can ex- 
tract using an admissible selection rule (see definition below) is also not biased. 
Kolmogorov and Loveland |15|, I14| proposed a more permissive definition of an ad- 
missible selection rule as any (partial) computable process which, having read any 
n bits of an infinite binary sequence a, picks a bit that has not been read yet, 
decides whether it should be selected or not, and then reads its value. When subse- 
quences selected by such a selection rule pass the unbiasedness test they are called 
Kolmogorov-Loveland stochastic (KL-stochastic for short). Martin Lof [ij6| intro- 
duced a notion of randomness which is now considered by many as the most satisfac- 
tory notion of algorithmic randomness. His definition says precisely which infinite 
binary sequences are random and which are not. The definition is probabilistically 
convincing in that it requires each random sequence to pass every algorithmically 
implementable statistical test of randomness. 

In this paper we are concerned with random sequences that arise from the process 
of learning and prediction, or more specifically, from the prediction mistakes made 
by a learner. Let X^^"^ = Xi, . . . ,Xn be a sequence of binary random variables 
drawn according to some unknown joint probability distribution P (X^")) . Consider 
the problem of learning to predict the next bit in a binary sequence drawn according 
to P. For training, the learner is given a finite sequence x*-™-* of bits xt £ {0, 1} , 
1 < t < m, drawn according to P and estimates a model A4 that can be used to 
predict the next bit of a partially observed sequence. After training, the learner is 
tested on another sequence a;*-"^ drawn according to the same unknown distribution 
P. Using Ai he produces the bit yt as a prediction for xt , 1 < t < n. Denote by 
the corresponding binary sequence of mistakes where ^4 = 1 if j/f 7^ Xt and is 
otherwise. Denote by ^g"-* the subsequence of ^("^ that corresponds to the times 

t where the learner predicted yt = 0. Note that ^q"'' is also a subsequence of x^"^ 
so we can view the process of predicting as a process of selecting a subsequence of 
the input a;'"^ 

(n) 

It is clear that the subsequence <^q of mistakes should be random since the test 
sequence a:*^"^ is random. It is reasonable to expect that the learner may implicitly 
vary some of the randomness characteristics of the subsequence of bits that he 
selects thereby cause ^q"' to be less random than x^'^\ In this sense, we may say 
that the learner 'deforms' the randomness of the input x'") producing a less random 
subsequence of x^"). Or perhaps the learner being of a finite complexity is limited 
in his ability to 'deform' randomness of x^"). Essentially we ask what 'interference' 
does a learner have on the randomness of a test sequence. It appears essential that 
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we look not only on the randomness of the object itself (the test sequence x*^"^) but 
also at the interfering entity — the learner, specifically, its algorithmic component 
that is used for prediction. 

3. Selection rule 

Let us formally define a selection rule. This is a principal concept used as part 
of tests of randomness of sequences (mentioned above). Let {0, 1}* be the space 
of all finite binary sequences and denote by {0, 1}" the set of all finite binary 
sequences of length n. An admissible selection rule R is defined |141 based 
on three partial recursive functions f,g and h on {0, 1}*. Let a;^"^ = Xi, . . . ,a:„. 
The process of selection is recursive. It begins with an empty sequence 0. The 
function / is responsible for selecting possible candidate bits of a;^") as elements 
of the subsequence to be formed. The function g examines the value of these 
bits and decides whether to include them in the subsequence. Thus / does so 
according to the following definition: /(0) = ii, and if at the current time k a 
subsequence has already been selected which consists of elements , . ■ . , then 
/ computes the index of the next element to be examined according to element 
/(xij , . . . , Xi^. ) = i where i ^ {ii , . . . , ik], i.e., the next element to be examined must 
not be one which has already been selected (notice that maybe i < ij, 1 < j < k, 
i.e., the selection rule can go backwards on x). Next, the two-valued function g 
selects this element Xi to be the next element of the constructed subsequence of 
X if and only if g{xi-^, . . . ,Xi^) — 1. The role of the two-valued function h is to 
decide when this process must be terminated. This subsequence selection process 
terminates if h[xi-^, . . . , Xi^.) = 1 or f{xi-^, . . . ,Xi^) > n. Let i?(a;^"-') denote the 
selected subsequence. By K{R\n) we mean the length of the shortest program 
computing the values of /, 5 and h given n. 

From the above discussion, we know that there are two principal measures related 
to the information content in a finite sequence stochasticity (unpredictability) 
and chaoticity (complexity) . An infinitely long binary sequence is regarded random 
if it satisfies the principle of stability of the frequency of Is for any of its subse- 
quences that are obtained by an admissible selection rule [T5|. Kolmogorov showed 
that the stochasticity of a finite binary sequence x may be precisely expressed by 
the deviation of the frequency of ones from some < p < 1, for any subsequence of 
x^"^ selected by an admissible selection rule R of finite complexity K{R\n) where 
for an object x given another object y he defined in [T3j the complexity of x as 

K{x\y) =mh\{l{TT) : (j){-K,y) = x} (3.1) 
where /(tt) is the length of the sequence tt, is a universal partial recursive function 
which acts as a description method, i.e., when provided with input (tt, y) it gives 
a specification for x (for an introduction see section 2 of [2_6j). The chaoticity 
of is large if its complexity is close to its length n. The classical work of 
[21 131 HH relates chaoticity to stochasticity. In |51 15] it is shown that chaoticity 
implies stochasticity. For a binary sequence s, let us denote by ||s|j the number of 
Is in s, then this can be seen from the following relationship (with p = 1/2): 



\\R{x 



;(i?(a;("))) 



< 



In - if(a;(»)|n) + K{R\n) +2\ogK{R\n) 
;(i?(x("))) 



(3.2) 



where Z(i?(a;'"'')) is the length of the subsequence selected by R and c > is some 
absolute constant. Apparently as the chaoticity of x^"^ grows the stochasticity of 
the selected subsequence i?(a;^"'') grows (the bias from 1/2 decreases). Also, and 
more relevant to the context of this paper, the information content of the selection 
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rule namely K{R\n) has a direct effect on this relationship: the lower K{R\n) the 
stronger the stability (smaller deviation of the frequency of Is from 1/2). In [llj the 
other direction which shows that stochasticity implies chaoticity is proved. 

It was recently shown in |il9i ,21 that the level of randomness of the subsequence 
^g"' of ^^"^ which corresponds to the occurrences of mistakes in predicting Os de- 
creases relative to an increase in the complexity of the learner. The approach taken 

(n) 

there is to represent the learner's decision as a selection rule that selects from 
^("), The rule's complexity is defined based on a combinatorial quantity rather 
than Kolmogorov complexity but still yields a relationship of the form of (3.2 1. 
This relationship shows that the possibility of deviation of the frequency of Is in 
^g"'' from the probability po of seeing a 1 in ^g"'' grows as the complexity of the 
class of possible decisions grows. 

The current paper investigates this experimentally. We consider a learner's pre- 
diction (or decision) rule which we term as system and study its influence on a 
random binary test sequence on which prediction decisions are made. The system 
is based on the maximum a posteriori probability decision where probabilities are 
defined by a statistical parametric model which is estimated from data. The learner 
of this model is a computer program that trains from a given random data sequence 
and then produces a decision rule by which it is able to predict (or decide) the value 
of the next bit in future (yet unseen) random binary sequences. As in [liH 121] we 
focus on Markov source and a Markov learner whose orders may differ. 



4. Relationship to information theory 

We now describe the connection between the concepts of entropy (Shannon en- 
tropy) and algorithmic complexity. Entropy is a measure of unpredictability of a 
random variable. Intuitively, we expect that the more unpredictable a sequence 
of random variables the higher its algorithmic (Kolmogorov) complexity. This is 
formally expressed as Theorem 14.3.1 in [9 which we now state: denote by H{Xi) 
the entropy of a random variable Xi and consider a sequence of random variables 
{Xi} drawn i.i.d. according to the probability mass function f{x), x G X , where X 
is a finite alphabet. Let /(x^")) = n"=i fi^i)- Then there exists a constant c such 
that 

H{X,) <-y /(x("))if(x(-)|n) < H{X,) + (\^\-^)^^&^ + ^ 

for all n. Consequently, the expected value E {X'^^'^^n) — H{Xi) with increas- 
ing n. This means that the expected value of the Kolmogorov complexity of the 
sequence converges to the Shannon entropy of the sequence with increasing n. 

A more relevant estimate for our work here concerns the Kolmogorov complexity 
of a specific sequence (not the expected value over all sequences) . In the case of a 
Bernoulli random sequence {Xi}^^^ with probability p — P{Xi = 1) its complexity 
relates to the binary entropy H{Xi) of any of the i.i.d. random variables of the 
sequence. It is based on the following statement which holds even more generally for 
any binary sequence of length n (Theorem 14.2.5 of [5]): Let x*^"^ — xi, X2, . . . ,Xn 
be a binary string then the Kolmogorov complexity of a;(") is bounded as 



i^(a:("V) < nHip) + -log^n + c (4.1) 

where P = ^ X^Li ^(p) = -plog2P - (1 - P) log2(l - p) is the entropy of a 
binary random variable with probability p and c is some finite positive constant 
independent of n and of the sequence a;^"-*. In particular, we may compute this 
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bound for the random mistake sequences that we are interested in. In section 
|6]we use this as a comparison with the empirical estimated algorithmic complexity 
which is obtained by compression. We proceed to describe the setup. 



5. EXPERIMENTL SETUP 

The learning problem consists of predicting the next bit in a given sequence 
generated by a Markov chain (model) A^* of order k* . There are 2*^ states in the 
model each represented by a word of k* bits. During a learning problem, the source's 
model is fixed. A learner, unaware of the source's model, has a Markov model of 
order k. We denote by ^(lli) the probability of transiting from state i whose binary 
fc-word is 6j = . . . , bi{k)] to the state whose word is [6^(2), . . . ,bi{k),l]. Given 

a random sequence of length m generated by the source the learner estimates its 
own model's parameters p(l|j) by 1 < i < 2^^, which is the frequency of the 

event "6^ is followed by a 1" in the training sequence. We denote by Ai the learnt 
model with parameters p{l\i), 1 < i < 2'^. We denote by p*{l\i) the transition 
probability from state i of the source model, 1 <i<2^. 

A simulation run is characterized by the parameters, k and m. It consists of a 
training and testing phases. In the training phase we show the learner a binary 
sequence of length m and he estimates the transition probabilities. In the testing 
phase we show the learner another random sequence (generated by the same source) 
of length n and test the learner's predictions on it. For each bit in the test sequence 
we record whether the learner has made a mistake. When a mistake occurs we 
indicate this by a 1 and when there is no mistake we write a 0. The resulting 
sequence of length n is the generalization mistake sequence ^'"^ We denote by ^q"^ 
the binary subsequence of ^("^ that corresponds to the mistakes that occurred only 
when the learner predicted a 0. Its length is denoted by riQ. We denote by po the 
probability of mistake when predicting a 0, i.e., po is the probability of seeing a 1 
in the subsequence ^p"'- 

For a fixed k denote by Nk^m the number of runs with a learner of order k and 
training sample of size m. The experimental setup consists of Nk^m — 10 runs with 
1 < /c < 10, TO G {100, 200, . . . , 10000} with a total of 100 • 10 • 7Vfe,,„ = 10000 runs. 
The testing sequence is of length n = 1000. Each run results in a file called system 
which contains a binary vector d whose i*'' bit represents the maximum a posteriori 
decision made at state i of the learner's model, i.e., 

d=[^ ^(^1*) > (5.1) 
' 1 otherwise 

for 1 < i < 2'^. Let us denote by — P{p{l\i) > 1/2), thus di are Bernoulli random 
variables with parameters a^, 1 < i < 2*^. The learner's system is comprised of the 
decision at every possible state. 

Another file generated is the errorTO which contains the mistake subsequence 
^g"^ At the end of each run we measure the lengths of the system file and its 
compressed length where compression is obtained either via the Gzip algorithm (a 
variant of [30]) or the PPM algorithm [8] and compute the sysRatio (denoted as 
p) which is the ratio of the compressed to uncompressed length of the system file. 
Note that p is a measure of information density since it captures the number of bits 
of useful information (useful for describing the system) per bit of representation (in 
the uncompressed file). 

(n) 

We do similarly for the mistake-subsequence obtaining the length £q of the 

compressed file that contains ^q"-* (henceforth referred to as the estimated algorith- 

(71) 

mic complexity of since it is an approximation of the Kolmogorov complexity of 
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^^"',see[^). We measure the KL-divergence Aq between the probability distribu- 
tion P{w\p) of binary words w of length 4 and the empirical probability distribution 
Pm{w) as measured from the mistake subsequence ^q"\ Note, P{w\p) is defined ac- 
cording to the Bernoulli model with parameter po, that is, P{w\pq) = pg(l — po)^~* 
for a word w with i ones, where po is the frequency of ones in the subsequence ^g"''. 
The distribution Pm (w) equals the frequency of a word w in ^q"'' . Hence Ag reflects 
by how much ^g"^ deviates from being random according to a Bernoulli sequence 
with parameter pg (the mistake probability when predicting a 0). 

6. Results 

We are interested in determining the relationship between the estimated algorith- 
mic complexity Iq of ^g"\ its divergence Ag and the learning performance. As the 
learning performance we look at the generalization error of type that is the error 
for 0-predictions. We choose four different levels of learning problems, controlled 
by the order of the source model k* = 3, 4, 5, 6. For each problem we choose for the 
source model a transition matrix of probabilities p*(l\i) = 1— p, p*{0\i) — p, where 
for some of the states i we set p — 0.3 and for others p = 0.7, 1 <i <2^ . Thus the 
Bayes optimal error is 0.3. To ensure that the problem is sufficiently challenging 
we set the first half of the states (those ranging from the A;* -dimensional vector 

00 ... to Oil ... 1) to have p = 0.3 and the second half (10 ... to 11 ... 1) to have 
p = 0.7. This ensures that a Markov model of order k < k* cannot approximate the 
true transition probabilities well. That is, the infinite-sample limit estimate based 
on a Markov model of order k which is smaller than k* will still be p(l|«) = 0.5, 

1 < i < 2^. But for a Markov model of order k > k* the infinite-sample size 
estimates will converge to the true values of p or 1 — p. 

6.1. Learning curves. Before we start to investigate the three relationships stated 
above we perform a sanity check to see how the prediction generalization error (for 
any of the two prediction types, not just when predicting a zero) varies with respect 
to the model complexity k and training length m. This is the so-called 'learning 
curves' in the areas of statistical pattern recognition and learning theory [1]. Figure 
|6.1 1 displays the contours of the error surface as a function of k and m for a learning 
problem with fe* = 5 (the Bayes error is 0.3). As can be seen, when k < k* the error 
remains very high, close to 0.5, regardless of the training sample size m (this is the 
leftmost contour colored in red). For k > k* the prediction error gets closer to the 
Bayes 0.3 value (outermost contour colored in dark blue) with increasing m. The 
shape of the contours indicate the tradeoff between approximation and estimation 
errors whose sum is the prediction error (standard results from learning theory, see 
for instance |27l [U H]). The larger that k becomes the lower the approximation 
error. The larger that m becomes the smaller the estimation error. 

We now proceed to describe the main result which concerns the relationship 
between the learner's performance and the mistake sequence complexity. 

6.2. sysRatio p versus k. First we look at the relationship between the sysRatio 
p and k. Figure [6^ shows the average of the sysRatio p as a function of k where 
in Figure |6.2[ ^A) we used Gzip as the compressor that estimates the Kolmogorov 



complexity and in Figure 6.2 (B) we used the PPM algorithm as compressor. Note 
that the PPM compressor obtains p values that are smaller than the Gzip compres- 
sor which means that the compressed lengths of the corresponding system files is 
smaller when using PPM. We believe that this is due to additional cost incurred 
by Gzip in the form of data structures that are appended to the compressed data. 
This is more noticeable when the file to be compressed is small (for instance, in 
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Figure 6.1. Generalization error with respect to k and m for fc* = 5 



the plot we see that the the sysRatio only goes below unity at around k = 4 which 
is when the uncompressed file length goes above 16). The PPM compressor thus 
approximates the algorithmic (Kolmogorov) complexity better than Gzip when the 
uncompressed files are relatively small. In the remainder of the paper we decided 
to keep the plots with respect to both types of compressors in order to show that 
the results of our analysis do not significantly vary as one changes from one type 
of compressor to another (in some places we put only the Gzip-based results since 
the differences were insignificant). 

Looking at the plots of Figure [6^ it is clear that the average sysRatio decreases 
as the learner's model order k increases. For the PPM compressor, we see a critical 
point at the vicinity of k* where the convexity of the graph changes from concave 
down to concave up possibly indicating an inflection point (this holds for learning 
problems with other values of k* , for instance in Appendix |A] we show this for 
fc* = 3 and k* —7). To explain this, first note that the uncompressed length of the 
system is always c • 2'^ for some constant c > since the vector d is of length 2*"' (see 
section [5]). The length of the compressed system file also grows, but at a slower rate 
with respect to k and this gives rise to the decrease in p with respect to k. We can 
explain why the rate of the compressed system file grows more slowly as follows: for 
values of fc < fc* the learner's model is incapable (by design of the learning problem) 
of estimating the Bayes optimal prediction and the probability of the events "bi is 
followed by a 1" is p{l\i) ^ , I < i < 2^ . Thus the average value p(l|j) 
of the indicators of such events is a Binomial random variable with a distribution 
symmetric at 1/2 and hence from (5.1 ) the probability ai that p{l\i) > V2 equals 1/2. 
The components of the random vector d are independent Bernoulli random variables 
with parameter Ui when conditioned on the sample size vector v (this is the vector 
whose components Vi are the number of times that bi appeared in the training 
sequence, see P5| for details). Since in this case = 1/2 then each component has 
a maximum entropy H{di) = —ai log — (1 — a^) log(l — a^) = log 2 = 1 and hence 
the expected value of the entropy of the vector d (with respect to the random sample 

size vector v) is maximal and equals EyH{d\v) — Ey H{di\vi) = Ey2^ — 2^ . 
Hence the expected compressed length of the system file (which contains the vector 
d) is large as the expected description length of any random variable is at least as 
large as its entropy. 

As k increases beyond k* the model becomes more capable of estimating the 
true transition probabilities (recall, these are either 0.3 or 0.7) and the probability 
p(l|«) of the events "6^ is followed by a 1" get farther away from 1/2 in the direction 
of 0.3 or 0.7, depending on the particular state i, 1 < i < 2*^. Thus the average 
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Figure 6.3. Generalization error with respect to p and m (for k* = 5) 



value of the indicators of such events is a Binomial random variable with an 

asymmetric distribution with a mean Hence from (5.1) the probability 

that p(l|i) > 1/2 gets either very close to or 1 as the training size m increases. 
Thus the components of the random vector d tend to be closer to deterministic. 
They are still random since the training sequence length is not increasing with k 
and the variance of the estimates p(l|i) does not converge to zero. Therefore for 
each of the 2^ components of the vector d the entropy is smaller than when k < k* . 
However as there are exponentially many components di, on the whole, the entropy 
of d (and hence the expected compressed length of the system file) still increases 
but at a lower rate than when k < k* . 

We can now alternatively look at the learning curves (section 6.1 1 based on the 
sysRatio (instead of fc). This is shown in Figure 6.3 Clearly, good learners are those 
with low value of sysRatio p (left uppermost region which is colored dark blue) while 
bad learners are those with a high sysRatio p, displayed as the rightmost contour 
which spans from lowest to highest m values. 

(n) 

We proceed now to discuss the characteristics of the mistake subsequence . 
First, in section |6.3| we study how its estimated algorithmic complexity £0 and 
divergence Aq depend on the learner's decision characteristics, or formally, the 
sysRatio p. In section |6.4| we fix the learner's model order k and study how £0 
depends on Aq. Finally in sections [6?6| and [6?7| we study the po and p surfaces over 
the A£-plane. 
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Figure 6.4. Estimated algorithmic complexity of the mistake 
subsequence fg"'' versus the sysRatio (average) 



6.3. Estimated algorithmic complexity £o and divergence Aq versus sys- 
Ratio p. Note, in the plots of this section we use the average sysRatio which is com- 
puted by taking for each value of 1 < /c < 10 the average over the iVfc^mlOO = 1000 
runs. Figure 6.4 shows the graph (with x) of the average estimated algorithmic 
complexity of versus the average system ratio p. The dashed lines are the 
upper and lower envelopes of the estimated standard deviation from the mean. This 
variance arises from the different values of training size m and from the fact that 
both the training and test sequences are random. The arrow points at the value of 
p* that corresponds to A:* = 5 (the source model order). As can be seen, for low 
values of p the spread in is low. There is a critical point at p* where the spread 
around the mean value of increases significantly as p increases. 

We know from section [3] that the higher the algorithmic complexity of a selection 
rule the higher the possible deviation of the frequency of Is in the selected subse- 
quence (the stochastic deviation). As mentioned above, in [19j it was shown that 
the decision rule of a learner can be represented as a selection rule that picks the 
subsequence corresponding to the mistakes made when predicting Os in the input 
test sequence. The theory predicts that the stochastic deviation of the mistake se- 
quence ^q"'' grows as the complexity of the decision rule increases. We now validate 
this experimentally. 



Figure 6.5 displays the graph (with x) of the average divergence Aq of the mistake 
subsequence versus the average of the sysRatio p where again averages are taken 
over the 1000 runs as described above. The dashed lines are the upper and lower 
envelopes of the standard deviation from the mean. The arrow points at the value 
of p* that corresponds to k* (the source model order). As can be seen, for low 
values of p the spread of Aq is low. Similar to the previous result for i^, also here 
we see a relative minimum at p* where the standard deviation around the mean 
value of Aq increases once we increase p beyond p* . Since we know there is an 
inverse relationship between p and k (Figure 6.2 1 then the small hook shape that 
appears to the left of the plot in Figure [63] indicates an increase in the Aq value as 
k increases beyond k* {p decreases below p*). Thus data overfitting (which occurs 
when k > k*^ is depicted here via this slight increase in the divergence Aq as we 
decrease p beyond the p* . 

It follows from this result that the sysRatio p (which is a measure of information 
density of the learner's model [2D]) influences how random are the mistakes made by 
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a learner. The sysRatio p is a proper measure of complexity of a learner's decision 
rule since it is with respect to p that the characteristics of the random mistake 
subsequence ^q"^ are consistent with the theory |19l [5T| , namely, the higher the 
sysRatio the more significant the deviation Ag of ^q"'* from a pure Bernoulli random 
sequence. 

We have so far considered p as an independent variable. In section [677] we study 
the sysRatio as a dependent variable, i.e., as a function of the estimated algorithmic 
complexity ^-nd divergence Aq. Before looking at that we proceed to show how £o 
varies with respect to the error pq which will now play the role of the independent 
variable. 

6.4. Estimated algorithmic complexity £o versus the error po for different 
values of k. We first mention that in all the figures below we reduced the number 
of d ata points (using simple random sampling) for clarity of presentation. Figure 
shows the estimated algorithmic complexity of the mistake subsequence ^q"^ 



6.6 



versus the probability of error po- The curves are a second order regression. For 
k — 2> < k* there is no clear relationship but for fc = 6 (just above fc*) we see 
a sharp rise in with respect to an increasing pa (the regression polynomial is: 
-448a;^ + 396x + 47). When A; = 10 (double the value of fc*) we see a less steep 
increase (the regression polynomial is: — 356a;^ + 325a; + 60). 

6.5. Estimated algorithmic complexity and divergence Aq versus error 
Pq over full range of m and k. In Figure |6.7[ A) we compare marked in red 



(x) to the ent ropy-based estimate of (4.1 1 marked in blue (+) where we substitute 



A I, in) 

for n in (|4.ip the length tiq of the sequence and the probability po for the 



parameter p. The value of the Pearson's correlation coefficient between €o and the 
entropy estimate is 0.925 indicating a high correlation (almost linear). Thus the 
entropy-based estimate appears to be good for the whole population of learners 
which consists of training sequences of size m = 100, 200, . . . , 10, 000 and models of 
order k = 1, 2, ... 10. In Figure [677[ A) for the data (marked by x) there appear 
to be two clusters of points (sequences) separated by an error probability gap at 
Pq k, 0.45. The first region is for p^ < 0.38. We refer to it as the cool cluster. Here 
the complexity £o values are concentrated. The other cluster (termed hot) is where 
Po > 0.45. Here the spread in values of £o is significantly larger than in the cool 
cluster. 
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Error Prob. 



Figure 6.6. Estimated algorithmic complexity Iq versus error 
probability, for A: = 3, 6 and 10 and k* = 5 , Gzip-based com- 
pressor 



In Figure |6.7p ) we see that the divergence Aq (marked by the symbols o) and 
the complexity £q (marked by x) are somewhat correlated (Pearson's coefficient of 
0.241) and it is due to the fact that the divergence values Aq are also split into two 
clusters which are in correspondence with the two clusters of the £q values. 

Let us look at the distribution of £o which is shown in Figure [6^ The distribution 
is very similar for both types of compressors. For the Gzip-based and PPM-based 
compressors the mean values are /io — 129, 109 and the distributions have skewness 
of 1.79, 1.8 and kurtosis of 2.39, 2.66, respectively ( for the normal distribution the 
skewness and kurtosis are 0). This indicates that the distributions are positively 
asymmetric (a heavier right tail) and peaked. 

6.6. The error po surface. Figure [6^ depicts the first central result of the pa- 
per. It displays the error probability po as a, function of the divergence Aq and 
estimated algorithmic complexity £q (we note that the jagged contour lines are due 
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Figure 6.7. Estimated algorithmic complexity £q and divergence 
Aq as function of po , Gzip-based compressor, for population of 
error sequences based on learners with m = 100, 200, . . . , 10, 000 
and models of order fc = 1, 2, . . . 10. 
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Figure 6.8. Histogram of £o for population of error sequences 
based on learners with m = 100, 200, . . . , 10, 000 and models of 
order fc = 1.2, . . . 10. 



to the interpolation mesh being limited in size and do not reflect actual data). At 
the center bottom we see the contour level of 0.356 (this is approximately the Bayes 
error level) and the topmost contour is at a value of 0.518 which corresponds to 
prediction by pure-guessing. We can ascertain the following from this interesting 
plot: the population of mistake sequences of lowest error probability (close to the 
Bayes 0.3 value) concentrates close to the mean value /xq and has a very low di- 
vergence Aq. This region corresponds to the cool cluster of Figure 6.7 (we call it 



the cool region and it appears in blue in Figure 6.9 1. This characteristic indicates 



that the sequences in the cool region are close to being truly random Bernoulli 
sequences with parameter po. As we start to look at a population of sequences 
with a higher error probability pq and walk along its fixed contour level we have a 
tradeoff between two possible choices: (1) to have a complexity £o value which is 



14 



JOEL RATSABY 




(a) Gzip based compressor (b) PPM based compressor 



Figure 6.9. Error probability po as a function of divergence Aq 
and estimated algorithmic complexity io 



far from the mean (less than or greater than (Iq) and maintain a low divergence Aq 
value or (2) to have a large divergence Aq and maintain an £o which is close to the 
mean /io. The union of the red and orange regions in Figure [6?9] corresponds to the 
hot cluster that we saw in Figure |6.7| By definition of the maximum a posteriori 
probability decision rule that we are using the error can never exceed 0.5 so the 
true error surface cannot exceed 0.5 and this is why we see that the empirical error 
surface ends at a contour level close to 0.5. 

An interesting point that we see here is that this surface is defined only over a 
part (colored region) of the Ai!-plane. We term this the admissible region of the 



A£ plane and it is induced by the error surface. In Figure 6.9 we see that the 
contour area is slightly larger on the right side of /xq than on the left of /iq which 
is consistent with the heavier right tail of the £o distribution in Figure |6.8| So 
admissibility appears to have a slight intrinsic bias towards complexity values Iq 
that are larger than the mean fiQ. 

If we regard sequences in the the cool region as truly random (i.e., having a 
complexity value £q close to the mean fiQ and a low divergence from Bernoulli) 
then we can introduce a new perspective on the process of learning. When the 
process is perfect, it produces a Bayes optimal predictor whose mistake sequence 
falls in the cool region. But when it is imperfect (due to limited training size m or 
improper model order k) the process produces a malformed sequence which is either 
atypically chaotic {£q far from ^q) but stochastic (low Aq) or typically chaotic {£q 
close /ip) but atypically stochastic (large Aq). 

So far we discussed the error surface which is intrinsically a property of the 
random mistake sequence since po is defined only based on the ratio of the number of 
Is to the length of the sequence. In the next section we examine the sysRatio surface 
which intrinsically is a learner's characteristic since it measures the information 
density p of the learner's decision rule. 

6.7. The sysRatio p surface. Figure [6.10| displays the next central result of the 
paper, a contour plot of the sysRatio p over the A£-plane. The outer contours (red) 
are for higher values of p. There are two relative minima one of which is at a lower 
value of Aq and touches the Aq = axis while the other appears above the 0.04 



divergence level. Based on what we already know about p versus k (Figure 6.2) 
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(a) Gzip based compressor (b) PPM based compressor 

Figure 6.10. sysratio p as a function of divergence Aq and esti- 
mated algorithmic complexity £o 



we can conclude that the lower minimum in Figure |6.10| is in a region of the plane 
that corresponds to sequences generated by learners of order k which is equal to k* 
or just slightly above k* (we call this region OM for 'overfitting minimum') while 
the upper relative minimum in Figure |6.10| is in the region of sequences generated 
by learners of order k which is slightly lower than k* (we call this region UM for 
'underfitting minimum'). The remaining regions (colored green to red) are where 
the learners have an order k significantly less than k* . Thus there is a saddle point 
as one passes from UM to OM and cross from k which is just under k* to k = k* . 
This is more pronounced in the Gzip-based compressor than in the PPM-based 
compressor. 

Based on this plot we can see that a decision rule with a high information density 
(sysRatio value p) yields an atypically chaotic random error sequence, i.e., with 
an estimated algorithmic complexity value £q that is far from the mean /ig. As 
the information density of the decision rule decreases the complexity of the error 
sequence moves towards a typical value (£q closer to /iq) and its divergence from 
Bernoulli decreases towards zero. 

Recall from the end of section |2] that the act of predicting bits of the input test 
sequence x" to be Os is equivalent to selecting from x"' a subsequence ■^q"'- 
are now in a position to understand that this selection process produces random 

(n) 

binary sequences of different character and 'spreads' them in different regions 
of the A£-plane. This spreading is a consequence of what we term scattering bits 
of a sequence since it resembles particle scattering in physics (it is also similar 
to the concept of chaotic scattering [17; where instead of initial conditions of the 
learner we characterize it by its information density p). Given a random input 
sequence x^"^ the learner (in his decision/selection action) effectively scatters the 
bits of a;(") in a way that resembles the binary collisions of particles in a beam with 
other particles that knock the beam particles into different directions. From this 
scattering the resulting sequence of bits is ^q"'* . The learner here acts as a static 
structure (a solid of some kind) , or a localized target such as a thin foil in a physical 
scattering experiment. Learners with high information density^ scatter bits of the 
input sequence more wildly thereby producing sequences (points in the A£-plane) 
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that deviate from typical complexity values or have high stochastic divergence. As 
mentioned in section [l] this is in line with the model introduced in [18] where a static 
structure is said to deform the randomness characteristics of an input sequence of 
excitations. 

It is interesting to ask at this point whether as a consequence of this phenomenon 
it may perhaps be possible to optimally fine-tune a learner's model-order k just by 

(n) 

observing the randomness characteristics of the mistake error , i.e., adjusting k 
in a direction that corresponds to decreasing p towards the OM region. It is not 
yet clear whether such a scheme that monitors the random characteristics of the 
mistake sequence would yield better performance (either accuracy or computational 
efficiency) compared to doing standard model-selection which adjusts k directly 
based on some form of estimate of the generalization error |10| . 



7. Conclusions 

This paper is an experimental investigation of the problem that was posed and 
theoretically solved in \IM . We have reconfirmed that the sysRatio p originally 
introduced in is a proper measure of the complexity of a learner's decision 

rule as it is with respect to p that the deformation of randomness of the mistake 
subsequence ^q"^ takes place in consistence with the theory, namely, the higher 
the value of p the more significant the divergence Ao of the mistake sequence ^q"'' 
relative to a pure Bernoulli random sequence. The two central results introduced 
in the current paper depict the special structure of the error probability po ^^nd 
sysRatio p surfaces over the A^-plane. They imply that bad learners generate 
atypically complex or stochastically divergent mistake sequences while good learners 
generate typically complex sequences with low divergence from Bernoulli. Since a 
learner can be modeled as a selection rule we name this phenomenon 'bit-scattering'. 
The idea follows the general model of static algorithmic interference introduced in 
|18] whereby effectively the learner acts as a static structure whose complexity is the 
sysRatio (information density p) . It produces randomly-deformed types of mistake 
sequences where deformation is proportional to p. 



Appendix A. 

In this section we present some additional auxiliary results pertaining to the 
relationship between the sysRatio p and model order k. In section [6^ for a learning 
problem with fc* = 5 we saw that for the PPM-based compressor the graph of the 
average p versus k is decreasing and has a critical point in the vicinity of k* . Figure 
|A.1| shows that this critical point also appears in learning problems with k* = 3. 
For k* — 7 there appears to be two critical points, one of which is at k* . 
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